Explore and Summarize the Red Wine dataset

Overview

We are going to explore a data set on red wine quality. The main objective is to explore the chemical properties that influence the quality of the wine. This tidy data set contains 1,599 observations and 13 variables, one of which (X) is a unique identifier.

A list of variable found in the red wine dataset:

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Types of the variable found

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

There are 2 integers in the dataset, X and quality. X is the index of each entry and not a rating. THe other variables are all numeric (decimals).

We will look at a summary of the data, omitting X as it does not factor in the rating of the wines.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Univariate Plots

Let’s melt the data, and visualize it in a boxplot, omitting the index X.

We can also use histograms to help understand the distributions better.

Most of these variables have a normal distribution. Chlorides and residual sugar need a further look, however. Let’s exclude the outliers (95th percentile) for these fields and re-plot them.

## Warning: Removed 79 rows containing non-finite values (stat_bin).
## Warning: Removed 80 rows containing non-finite values (stat_bin).

Excluding outliers, these fields appear to have a normal distribution as well.

Here is a summary of residual.sugar:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Here is a summary of chlorides:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Quality is an important factor in determining wine selection. Let’s take a deeper look.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

The most common ratings are 5 and 6, respectively.

Alcohol is another important variable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

An alcohol content of 10 is the most common, with 9 being next.

Univariate Analysis

Most of the fields are distributed normally. Residual.sugar and chlorides appear to be normally distributed once the outliers are removed. The most common rating for quality are 5 and 6. The most common ratings for alcohol are 10, 9, and 11. We will explore this more in-depth as we go, but I suspect alcohol content is a good indicator of quality.

Bivariate Plots

We can visualize the relationship between each pair of variables and find the correlation. The names along the x and y axis of the plot matrix below are as follows:

The four highest correlation coefficients with quality are:

  1. alcohol at 0.48
  2. sulphates at 0.25
  3. citric.acid at 0.23
  4. fixed.acidity at 0.12

Alcohol content has a high correlation with red wine quality. Other important attributes correlated with red wine quality include sulphates, citric acid and fixed acidity.

The four biggest negative correlation coefficients with quality are:

  1. volatile.acidity at -0.39
  2. total.sulfur.dioxide at -0.19
  3. density at -0.17
  4. chlorides at -0.13

Volatile acids, total sulfur dioxide, density and chlorides are all negatively correlated with quality.

The highest correlations, both positive and negative, include:

  • fixed.acidity to citirc.acid at 0.67
  • fixed.acidity to density at 0.67
  • free.sulfur.dioxide to total.sulfur.dioxide at 0.67
  • alcohol to quality at 0.48
  • density to alcohol at -0.50
  • citric.acid to pH at -0.54
  • volatile.acidity to citirc.acid at -0.55
  • fixed.acidity to pH at -0.68

We will take more in depth look at density and alcohol:

At the high and lowest points of alcohol, there is not much density. But there is a trend towards higher density as alcohol content drops.

Let’s look at fixed acidity and pH:

We see fixed acidity increase as pH decreases.

Let’s look at fixed acidity and density:

Fixed acidity increases as density increases.

Bivariate Analysis

Alcohol has the highest positive correlation with quality, followed by sulphates, citric.acid, and fixed.acidity. Volatile.acidity has the largest negative correlation, followed by total.sulfur.dioxide, density, and chlorides. We explored this further by comparing alcohol and density. Density rises as alcohol drops. Given their negative and positive correlations, this is to be expected. It’s the same with fixed acidity and ph, which has the highest negative correlation coefficient among our fields. Density and fixed acidity correlate positively, and trend in the same diretion. More density means more fixed acidity.

Multivariate Plots

Let’s look at the alcohol content by red wine quality using a density plot function:

As we have consistently shown, higher alcohol content correlates with higher quality. The outlier appears to be red wines having a quality ranking of 5.

Here are the summary statistics for alcohol content at each quality level:

## factor(wine$quality): 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## factor(wine$quality): 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## factor(wine$quality): 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## factor(wine$quality): 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## factor(wine$quality): 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## factor(wine$quality): 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

It appears that sulphate content is quite important for red wine quality, particularly for the highest quality levels including quality 7 and 8.

And here are the summary statistics for sulphates at each quality level:

## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

Let’s look at the relationship between sulphates, volatile.acidity, quality:

Higher quality red wines tend to be concentrated in the top left of the plot.

Again, higher quality red wines tend to be concentrated in the top left of the plot.

Let’s summarize quality using a contour plot of alcohol and sulphate content:

Higher quality red wines are generally located near the upper right of the scatter and lower quality red wines are generally located in the bottom right.

We’ll create a similar plot but quality will be visualized using density plots along the x and y axis:

Multivariate Summary

As with previous explorations, we can see alcohol content is big factor in the quality of the red wine. Lower volatile acility and lower sulphates also seem to correlate with higher quality.

Final Plots & Summary

The strongest correlation coefficient was between alcohol and quality. We’ll examine the alcohol content by quality using a density plot function:

Density plots for higher quality red wines are right shifted, meaning they have a comparatively high alcohol content, compared to the lower quality red wines. The outlier to this trend appears to be red wines having a quality ranking of 5.

Let’s look at a summary of alcohol content at each quality level:

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

Sulphates were found to correlate with red wine quality (R^2= 0.25) while volatile acid had a negative correlation (R^2=-0.39). We can visualize the relationships betwen these two variables, along with alcohol content and red wine quality using a scatter plot:

We see a clear trend where higher quality red whines are concentrated in the lower left of the plot.

We see that it is pretty evenly distributed along the X axis towards the bottom of the Y axis.

Here is a summary of alcohol content by quality:

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

By sulphate content:

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

And by volatile.acidity

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

We can also visualize the relationship between alcohol content and sulphates by combining scatter plots with density plots:

Reflection

I opted to use red wine dataset as it was reccommended for the project. I think the biggest challenge was familiarizing myself with both R and R Studio. With the help of google and (mostly) stackoverflow I was able to get up to speed. At that point, I did not find the dataset particularly daunting.

I analyzed the relationship of a number of attributes to the quality ratings. Melting the data and using facet grids was helpful for visualizing the distribution of each of the variables with the use of boxplots and histograms. GGally was helpful as it provided conscise summaries of the paired relationships. Density plots were helpful in exploring the correlations I found from the paired plots. Once I had this plotted it was interesting to build up the multivariate scatter and density plots to visualize the relationship of different variables with quality.

One step we could take next would be to analyze other wine datasets like the white wine set. Do the trends we found here carry over to a different wine type? That would be interesting to research.

Another step would be to incorporate machine learning techniques to build a predictive model. That would require a much larger dataset. With the various properties being measures, the interplay between them could be perfect for machine learning.